Goto

Collaborating Authors

 receptive field


Focal Modulation Networks

Neural Information Processing Systems

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision. Focal modulation comprises three components: (i)hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to fuse the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.





MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Neural Information Processing Systems

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult.


Accuracy [% ] Elastic Transform 1 2 3 4 5 0 20

Neural Information Processing Systems

Here we compute the mean and standard deviation across seeds. Model Robustness score Baseline 100% MTL with real responses 109% MTL with predicted responses (MTL-Monkey) 118% MTL with shuffled predicted responses (MTL-Shuffled) 98% Table 3: Comparing our MTL model co-trained on predicted neural responses -MTL-Monkey in the paper-to the MTL model co-trained directly on real monkey V1 responses. We computed the robustness score of each model after averaging the accuracies of 3 seeds per model for each corruption type in TIN-TC and normalizing against the baseline test accuracies, i.e. the baseline score is 100%. We find that we can obtain a general increase in robustness when using real neural data. However, co-training on predicted neural responses improves the robustness of the models even more.


5 Supplementary Material

Neural Information Processing Systems

Dendritic updates Complete versions of the dendritic update rules (summarised in Eqns (2) & (3)) are given below. This is valid in our regime where the environmental latent updates slowly compared to neural timescales. The notation we're using admits the possible presence of biases as well as the weights (though biases typically aren't used) by assuming a row of constant 1's could be added to the synaptic inputs effectively absorbing a bias into the weight matrix without loss of generality, for example wgB p(t) wgB p(t)+ bgB . Somatic updates Somatic updates rules (Eqns (4) & (5)) and are repeated here for completeness: p(t)= (t)pB(t)+(1 (t))pA(t) g(t)= (t)gB(t)+(1 (t))gA(t). Update ordering For this hierarchical network of multicompartmental neurons we must specify the order in which we perform these discrete updates to the different layers and the different compartments within these layers.




START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation

Neural Information Processing Systems

Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity.